Mixed models are used when you have some type of grouping variable within your data (e.g., departments).
They can also be used when you have a repeated measurement over the same person (e.g., testing).
Mixed models can also incorporate information from both grouping variables and repeated measurements (e.g., yearly job satisfaction across departments).
We will need the following:
install.packages(c("lme4", "lmerTest", "merTools", "MuMIn"))
While the specifics of each model you have learned to this point might take some time to get our heads all the way around, the terminology has largely been pretty clear – no more. You will hear “mixed models”, “mixed effects models”, “hierarchical linear models”, “nested models”, and/or “multilevel models”; these are all slight variations on a common theme. For the sake of our work here, we will keep it at mixed models. Within our mixed model, we have an additional source of cloudiness: fixed and random effects. The random effects don’t pose much of an issue (we will define it later), but fixed effects have 4 different definitions depending upon whom you ask (you will commonly see it used for panel or time-series data when people want to remove the variation of a group or individual). For the sake of simplicity (again), we are going to consider fixed effects as an effect on the individual unit of analysis. This will all start to make sense once we take a look at the models.
Let’s assume that we are going to try to predict sales for a group of stores:
\[sales = \beta_{intercept} + \beta_{rebate}*Rebate + \epsilon\]
The error term would assumed to be normally distributed with a mean of 0 and some standard deviation:
\[\epsilon \sim \mathscr{N}(0, \sigma) \]
With that, we can take those equations and shuffle them around a bit to capture the data generation process for sales:
\[sales \sim \mathscr{N}(\mu, \sigma)\]
\[\mu = \beta_{intercept} + \beta_{rebate}*Rebate\]
Now with that, we can add the effects for another variable at a higher level (i.e., the effects of store):
\[sales = \beta_{intercept} + \beta_{rebate}*Rebate + (effect_{store} + \epsilon)\]
And look at the error terms:
\[ effect_{store} \sim \mathscr{N}(0, \tau)\]
Where we can say that the effect of store is normally distributed with a mean of 0 and some standard deviation.
And rolling it all together, we get:
\[sales = (\beta_{intercept} + effect_{store}) + \beta_{rebate}*rebate + \epsilon)\]
As mentioned earlier, we will get improved standard errors.
Mixed models don’t get bogged down by large groups (i.e., large groups do not dominate the inference) and the smaller groups do not get buried by the larger groups.
Mixed models will not overfit or underfit when we have repeated samples/measurement. We also get improved estimates for repeated measures when dealing with mixed models.
Our models retain the variation inherent within the data when we used a mixed model.
That is a pretty clear effect for the number of visits and the total spent.
But let’s look a bit further:
This changes things entirely!
For the sake of conceptual grounding, let’s do some data prep and go back to our standard linear model:
library(data.table)
library(ggplot2)
performanceReview <- fread(
"http://www.nd.edu/~sberry5/data/performanceReviewData.csv"
)
lmTest <- lm(performanceScore ~ Manager + age,
data = performanceReview)
summary(lmTest)
Call:
lm(formula = performanceScore ~ Manager + age, data = performanceReview)
Residuals:
Min 1Q Median 3Q Max
-7.3230 -1.7861 0.0972 1.7638 6.5431
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.924348 0.373866 152.26 <2e-16 ***
Manager -5.132527 0.226195 -22.69 <2e-16 ***
age -0.356581 0.007252 -49.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.613 on 531 degrees of freedom
Multiple R-squared: 0.8453, Adjusted R-squared: 0.8447
F-statistic: 1451 on 2 and 531 DF, p-value: < 2.2e-16
ggplot(performanceReview,
aes(performanceScore, age, color = as.factor(Manager))) +
geom_point() +
theme_minimal()
We have our standard output here. As before, our intercept is the average score when everything is at 0, we have a categorical variable (Manager) with simple constrasts, and the continuous predictor of age.
Let’s see how some additional information changes our viz:
ggplot(performanceReview,
aes(performanceScore, age, color = as.factor(Manager))) +
geom_point() +
facet_wrap( ~ idBuilding) +
theme_minimal()
Let’s include the department variable in our model. We are not going to add it as another predictor, but we are going to include it as another level to our model. The lme4 package will make this very easy. We will start with an intercept-only model:
library(lme4)
intMod <- lmer(performanceScore ~ 1 + (1|idDepartment),
data = performanceReview)
Before we look at the summary for this model, let’s get an idea about what is happening in the syntax. The first part of our formula should look familiar – these are the global estimates (fixed effects) within our model and will behave exactly the same as our standard linear model. We are just specifying an intercept here.
The next part in the parentheses is how we denote our random effect. Whenever you see a 1 included in a formula interface, we can be pretty comfortable that it is in reference to a intercept. The | specifies a grouping. With that information, we might be able to guess that we are specifying a random intercept for each borough.
We should probably check out the summary:
summary(intMod)
Linear mixed model fit by REML ['lmerMod']
Formula: performanceScore ~ 1 + (1 | idDepartment)
Data: performanceReview
REML criterion at convergence: 3505.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.22396 -0.71120 -0.03982 0.74784 2.24396
Random effects:
Groups Name Variance Std.Dev.
idDepartment (Intercept) 5.22 2.285
Residual 39.09 6.252
Number of obs: 534, groups: idDepartment, 26
Fixed effects:
Estimate Std. Error t value
(Intercept) 37.7234 0.5266 71.64
We don’t really get anything too interesting here for now. Let’s do explore the random effects components just a little bit. Let’s do the following with our variances:
\[\frac{\text{intercept variance}}{\text{intercept variance} + \text{residual variance}}\]
5.22 / (39.09 + 5.22)
[1] 0.1178064
This is what is known as the intraclass correlation (ICC). It ranges from 0 (no variance between clusters) to 1 (variance between clusters). It can be thought of as how much the proportion of variance in scores is accounted for by the department alone. Computationally the ICC is as follows:
\[\rho = \frac{\tau^2}{\tau^2 + \sigma^2} \]
Where \(\tau^2\) is the population variance between clusters and \(\sigma^2\) is the population variance within clusters.
We can compute \(\sigma^2\) as:
\[\sigma^2 = \frac{\Sigma(n_j - 1)S^2_j}{N-C}\]
Where \(S^2_j\) (the variance within the cluster) is defined as:
\[\frac{\Sigma(y_{ij}-\bar{y}_j)}{(n_j - 1)}\]
where
\(n_j\) = sample size for cluster j
N = total sample size
C = total number of clusters
The calculation of \(\tau^2\) is as follows:
\[\tau^2 = \hat{S}^2_B - \frac{\hat{\sigma}^2}{\tilde{n}}\]
where
\[\hat{S}^2_B = \frac{\Sigma n_j(\bar{y_j - \bar{y}})^2}{\tilde{n}(C-1)}\] where
\(\bar{y}_j\) = mean on response variable for cluster j
\(\bar{y}\) = overall mean on response variable
\[\tilde{n} = \frac{1}{C-1} \begin{bmatrix}N-\frac{\Sigma n^2_j}{N}\end{bmatrix}\]
We could do all of this work by hand and before running the model, but I’ll let you do that on your own time if you really feel the need to do so.
We can add a predictor variable to the model:
riMod <- lmer(performanceScore ~ Manager + age + (1|idDepartment),
data = performanceReview)
summary(riMod)
Linear mixed model fit by REML ['lmerMod']
Formula: performanceScore ~ Manager + age + (1 | idDepartment)
Data: performanceReview
REML criterion at convergence: 2002.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.59312 -0.72445 -0.02044 0.71297 2.23267
Random effects:
Groups Name Variance Std.Dev.
idDepartment (Intercept) 4.899 2.213
Residual 2.028 1.424
Number of obs: 534, groups: idDepartment, 26
Fixed effects:
Estimate Std. Error t value
(Intercept) 56.931410 0.482595 117.97
Manager -5.270653 0.125466 -42.01
age -0.354830 0.004066 -87.28
Correlation of Fixed Effects:
(Intr) Managr
Manager -0.137
age -0.395 0.012
When groups are largely balanced, we would find that our coefficients would be the same (or very close to it).
What should almost always change is our standard errors – by integrating information about the groups, we are getting a better sense of how much uncertainty our model contains at the global average level.
We also see some additional information – this is for our random effects. The standard deviation is telling us how much the score moves around based upon department after getting the information from our fixed effects (i.e., the score can move around nearly 2 whole point from department alone).
We can get a good sense of the random effect estimates:
ranef(riMod)
$idDepartment
(Intercept)
1 1.06318571
2 1.07377298
3 0.22842693
4 3.25746712
5 3.50057630
6 4.26420843
7 1.67845671
8 0.75113541
9 -0.39817728
10 0.90764199
11 -1.09916574
12 -0.66117892
13 2.25956540
14 1.55198242
15 -3.13673779
16 -1.85889149
17 -4.37484768
18 -4.35611033
19 -1.29756268
20 0.81728901
21 -1.20309868
22 0.64450265
23 0.09077807
24 -0.79003136
25 -0.34199796
26 -2.57118923
with conditional variances for "idDepartment"
We can also use various bits of information in our output to create different R^2 values.
For the first level of the model, we can calculate the \(R^2\) as:
\[R^2_11 - \frac{\sigma^2_{M1} + \tau^2_{M1}}{\sigma^2_{M0} + \tau^2_{M0}}\]
We can pull that information from our two model outputs:
m1Sigma <- 4.899 # full model random effect intercept
m1Tau <- 2.028 # full model random effect residual
m0Sigma <- 5.22 # null model random effect intercept
m0Tau <- 39.09 # null model random effect residual
1 - ((m1Sigma + m1Tau) / (m0Sigma + m0Tau))
[1] 0.8436696
The fixed effects of manager and age account for nearly 85% of the variation within the scores.
The second level can be calculated as:
\[R^2_2 = 1 - \frac{\sigma^2_{M1} / B + \tau^2_{M1}}{\sigma^2_{M0} / B + \tau^2_{M0}}\]
We see that we added a B into the mix here and it is just the average size of the level 2 units (534 / 26).
level2Mean <- 534 / 26
r2Numerator <- m1Sigma / level2Mean + m1Tau
r2Denominator <- m0Sigma / level2Mean + m0Tau
1 - (r2Numerator / r2Denominator)
[1] 0.9423923
You can also get these values like this:
MuMIn::r.squaredGLMM(riMod)
R2m R2c
[1,] 0.8430864 0.9540581
Here we have two values: the marginal R2 (R2m) and the conditional R2 (R2c). You can think of the marginal values as the standard type of R2 – it is the variability explained by the fixed effects part of the model (it is what we have already done above). The conditional R2 is using both fixed and random effects. So in this case, we would see that we are accounting for about 11% of the variation in wave alone
Let’s take a look at the effects:
library(sjPlot)
plot_model(riMod, type = "re") +
theme_minimal()
For the plot above, the easiest way to think about it is that the values are effects for each individual random effect (i.e., each departments random intercept). Since we are just dealing with intercepts right now, they are the deviations from the fixed intercept. The intercept for the model is ~57 and the random effect for department 6 appears to be about 4.25. If we wanted to predict scores for someone from department 6, we would take -57 + 4.25 for the intercept portion of the model (the same would go for any random slopes in the model). In the end, it is showing how much the intercept shifts from department to department, and some department have a positive effect beyond the average and others have a negative effect (department 17, for instance, is well below the average).
We can also plot simulated random effect ranges for each of the random effect groups. We want to pay attention to those that are highlighted with black (i.e., the range does not cross the red line at 0).
library(merTools)
plotREsim(REsim(riMod), labs = TRUE)
In examining the plot, we can see a very broad range of department effects on scores, with most of them being significant (noted by the black dots).
Going back to the output, did you notice anything missing: p-values! Estimating p-values in a mixed model is exceedingly difficult because of varying group sizes, complete sample n, and how those relate to reference distributions. If you need something that will help, you can get confidence intervals in the same way that you would anything else:
confint(riMod)
2.5 % 97.5 %
.sig01 1.679106 2.9393080
.sigma 1.338259 1.5134553
(Intercept) 55.976723 57.8862542
Manager -5.516426 -5.0246447
age -0.362799 -0.3468642
Those confidence intervals are interpreted how they always are, but we see two things that don’t look familiar: .sig01 and .sigma. These are related to our random effects, where .sig01 relates to the random intercept
If you really want to see p-values, you can get them easily:
riModP <- lmerTest::lmer(performanceScore ~ Manager + age +
(1|idDepartment),
data = performanceReview)
summary(riModP)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: performanceScore ~ Manager + age + (1 | idDepartment)
Data: performanceReview
REML criterion at convergence: 2002.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.59312 -0.72445 -0.02044 0.71297 2.23267
Random effects:
Groups Name Variance Std.Dev.
idDepartment (Intercept) 4.899 2.213
Residual 2.028 1.424
Number of obs: 534, groups: idDepartment, 26
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 56.931410 0.482595 36.539055 117.97 <2e-16 ***
Manager -5.270653 0.125466 506.793426 -42.01 <2e-16 ***
age -0.354830 0.004066 507.492183 -87.28 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) Managr
Manager -0.137
age -0.395 0.012
NOTE: I would never load the lmerTest package, but would attach with colons! If you load it, you will find that it masks things from lme4 that you don’t want to have masked (i.e., lmer) and they are not equivalent objects!
Getting predicted values from our mixed model is no different then getting them from any other model:
mixedPred <- predict(riMod)
slimPred <- predict(lmTest)
allPred <- cbind(actual = performanceReview$performanceScore,
mixed = mixedPred,
slim = slimPred)
head(allPred, 20)
actual mixed slim
1 40.31653 37.59082 36.58406
2 41.99031 41.55702 40.40570
3 38.42164 39.74343 38.74729
4 39.18218 39.08405 37.92053
5 38.11899 38.73618 37.57095
6 31.10169 29.71230 28.66668
7 31.15377 33.45472 32.42756
8 34.37665 36.32163 35.14449
9 34.16259 33.44172 32.41449
10 36.89173 34.56446 33.37865
11 33.55405 33.86593 32.84080
12 44.59988 42.11135 40.96276
13 36.72473 37.38279 36.37500
14 31.56297 32.80710 31.61262
15 42.04933 41.50969 40.35814
16 40.66954 39.73064 38.73443
17 51.75952 50.39556 49.28783
18 43.14972 43.28860 42.30994
19 48.41701 50.43617 49.32864
20 25.17246 27.30469 26.24720
While there were cases where the standard linear model did a slightly better job, our mixed model generally did better. Plotting these makes it even more clear:
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
Let’s add some more information to our model. As we dive into our data, we will notice that we also have a building variable. We can add this additional grouping into our model:
clusterMod <- lmer(performanceScore ~ Manager + age +
(1|idDepartment) + (1|idBuilding),
data = performanceReview)
This is often called a cross-classified model.
summary(clusterMod)
Linear mixed model fit by REML ['lmerMod']
Formula:
performanceScore ~ Manager + age + (1 | idDepartment) + (1 |
idBuilding)
Data: performanceReview
REML criterion at convergence: 1986.7
Scaled residuals:
Min 1Q Median 3Q Max
-2.56000 -0.73269 -0.03636 0.71645 2.26861
Random effects:
Groups Name Variance Std.Dev.
idDepartment (Intercept) 1.296 1.138
idBuilding (Intercept) 4.024 2.006
Residual 2.028 1.424
Number of obs: 534, groups: idDepartment, 26; idBuilding, 8
Fixed effects:
Estimate Std. Error t value
(Intercept) 57.091966 0.777901 73.39
Manager -5.271945 0.125411 -42.04
age -0.354929 0.004059 -87.44
Correlation of Fixed Effects:
(Intr) Managr
Manager -0.085
age -0.247 0.012
plotREsim(REsim(clusterMod), labs = TRUE)
We should also see if our predictions improved:
clusterPredict <- predict(clusterMod)
allPred <- cbind(actual = performanceReview$performanceScore,
clustered = clusterPredict,
mixed = mixedPred,
slim = slimPred)
head(allPred, 20)
actual clustered mixed slim
1 40.31653 37.65219 37.59082 36.58406
2 41.99031 41.61932 41.55702 40.40570
3 38.42164 39.80540 39.74343 38.74729
4 39.18218 39.14566 39.08405 37.92053
5 38.11899 38.79769 38.73618 37.57095
6 31.10169 29.77148 29.71230 28.66668
7 31.15377 33.51494 33.45472 32.42756
8 34.37665 36.38247 36.32163 35.14449
9 34.16259 33.50193 33.44172 32.41449
10 36.89173 34.62481 34.56446 33.37865
11 33.55405 33.92626 33.86593 32.84080
12 44.59988 42.17380 42.11135 40.96276
13 36.72473 37.44410 37.38279 36.37500
14 31.56297 32.86696 32.80710 31.61262
15 42.04933 41.57198 41.50969 40.35814
16 40.66954 39.79261 39.73064 38.73443
17 51.75952 50.46033 50.39556 49.28783
18 43.14972 43.35156 43.28860 42.30994
19 48.41701 50.50094 50.43617 49.32864
20 25.17246 27.36320 27.30469 26.24720
For many observations, our predictions definitely go tighter, but there isn’t a big difference between our two mixed models.
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
plot(allPred[, "actual"], allPred[, "clustered"])
Hierarchical models are a slight variation on the models that we have just seen. In these models, we have groups nested within other groups. We know that we have a department variable and a building variable. One source of inquiry would be to incorporate the nesting of this information.
The model set-up is just different enough to cause some potential confusion here. Within the parentheses, we have our intercept as before, but we are also saying that we are looking at the departments nested within buildings (it follows an A/B constructions).
hierMod <- lmer(performanceScore ~ Manager + age +
(1|idBuilding/idDepartment),
data = performanceReview)
summary(hierMod)
Linear mixed model fit by REML ['lmerMod']
Formula:
performanceScore ~ Manager + age + (1 | idBuilding/idDepartment)
Data: performanceReview
REML criterion at convergence: 1986.7
Scaled residuals:
Min 1Q Median 3Q Max
-2.56000 -0.73269 -0.03636 0.71645 2.26861
Random effects:
Groups Name Variance Std.Dev.
idDepartment:idBuilding (Intercept) 1.296 1.138
idBuilding (Intercept) 4.024 2.006
Residual 2.028 1.424
Number of obs: 534, groups:
idDepartment:idBuilding, 26; idBuilding, 8
Fixed effects:
Estimate Std. Error t value
(Intercept) 57.091966 0.777901 73.39
Manager -5.271945 0.125411 -42.04
age -0.354929 0.004059 -87.44
Correlation of Fixed Effects:
(Intr) Managr
Manager -0.085
age -0.247 0.012
We have our fixed effect for observation and we have individual intercepts for both idBuilding and department nested within building. Looking at the standard deviation of our random effects, we can see that the scores bounce around a little over 1 point from from department to department within a building.
We can also look at our effect ranges:
plotREsim(REsim(hierMod), labs = TRUE)
And checking our predictions again:
hierModPredict <- predict(hierMod)
allPred <- cbind(actual = performanceReview$performanceScore,
hierMod = hierModPredict,
clustered = clusterPredict,
mixed = mixedPred,
slim = slimPred)
head(allPred, 20)
actual hierMod clustered mixed slim
1 40.31653 37.65219 37.65219 37.59082 36.58406
2 41.99031 41.61932 41.61932 41.55702 40.40570
3 38.42164 39.80540 39.80540 39.74343 38.74729
4 39.18218 39.14566 39.14566 39.08405 37.92053
5 38.11899 38.79769 38.79769 38.73618 37.57095
6 31.10169 29.77148 29.77148 29.71230 28.66668
7 31.15377 33.51494 33.51494 33.45472 32.42756
8 34.37665 36.38247 36.38247 36.32163 35.14449
9 34.16259 33.50193 33.50193 33.44172 32.41449
10 36.89173 34.62481 34.62481 34.56446 33.37865
11 33.55405 33.92626 33.92626 33.86593 32.84080
12 44.59988 42.17380 42.17380 42.11135 40.96276
13 36.72473 37.44410 37.44410 37.38279 36.37500
14 31.56297 32.86696 32.86696 32.80710 31.61262
15 42.04933 41.57198 41.57198 41.50969 40.35814
16 40.66954 39.79261 39.79261 39.73064 38.73443
17 51.75952 50.46033 50.46033 50.39556 49.28783
18 43.14972 43.35156 43.35156 43.28860 42.30994
19 48.41701 50.50094 50.50094 50.43617 49.32864
20 25.17246 27.36320 27.36320 27.30469 26.24720
Still, not much of a difference bewteen our mixed effects models:
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
plot(allPred[, "actual"], allPred[, "clustered"])
plot(allPred[, "actual"], allPred[, "hierMod"])
For these predictions, it really does not look like anything changed at all. You might not see massive changes in fit when moving from model to model. The more important question, though, is your model reflecting the reality of your data?
Now that we have seen random intercepts and hierarchical models, we can add one final piece: random slopes. In the following model, we will specify a random intercept (recall, it is the 1 within our parenthesis) and a random slope (we are prefixing our grouping variable with a slope of interest). Not only will this model allow the intercept to vary between groups, but it will also allow the slope to vary between groups.
randomSlopes <- lmer(performanceScore ~ Manager + age +
(age|idDepartment),
data = performanceReview)
summary(randomSlopes)
Linear mixed model fit by REML ['lmerMod']
Formula: performanceScore ~ Manager + age + (age | idDepartment)
Data: performanceReview
REML criterion at convergence: 2002.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.59430 -0.72745 -0.02409 0.71186 2.22131
Random effects:
Groups Name Variance Std.Dev. Corr
idDepartment (Intercept) 5.057e+00 2.2487018
age 5.563e-07 0.0007459 -1.00
Residual 2.028e+00 1.4241169
Number of obs: 534, groups: idDepartment, 26
Fixed effects:
Estimate Std. Error t value
(Intercept) 56.933336 0.488854 116.46
Manager -5.271157 0.125458 -42.02
age -0.354869 0.004069 -87.22
Correlation of Fixed Effects:
(Intr) Managr
Manager -0.136
age -0.422 0.012
convergence code: 0
boundary (singular) fit: see ?isSingular
Nothing changes with regard to our fixed effects, but we get some added information in our random effects. The random intercept standard deviation for department is telling us the amount that the scores bounce around from place to place and the age variance is telling us how much variability there is within the slope between locations.
We can look at it graphically:
performanceReview$idDepartment <- as.factor(performanceReview$idDepartment)
ggplot(performanceReview, aes(performanceScore, age,
group = idDepartment, color = idDepartment)) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
When looking at this visualization, it becomes clear that there really is not too much difference in the slopes of age between departments here.
We can also fit that random effects model to our hierarchical model:
observationModSlope <- lmer(performanceScore ~ Manager + age +
(age|idBuilding/idDepartment),
data = performanceReview)
summary(observationModSlope)
Linear mixed model fit by REML ['lmerMod']
Formula:
performanceScore ~ Manager + age + (age | idBuilding/idDepartment)
Data: performanceReview
REML criterion at convergence: 1987.9
Scaled residuals:
Min 1Q Median 3Q Max
-2.5713 -0.7514 -0.0326 0.7007 2.2012
Random effects:
Groups Name Variance Std.Dev. Corr
idDepartment:idBuilding (Intercept) 2.440e+00 1.5621734
age 2.908e-05 0.0053929 -1.00
idBuilding (Intercept) 5.608e+00 2.3682083
age 5.576e-08 0.0002361 -0.93
Residual 2.003e+00 1.4154356
Number of obs: 534, groups:
idDepartment:idBuilding, 26; idBuilding, 8
Fixed effects:
Estimate Std. Error t value
(Intercept) 57.115452 0.920617 62.04
Manager -5.269283 0.124735 -42.24
age -0.355020 0.004181 -84.92
Correlation of Fixed Effects:
(Intr) Managr
Manager -0.071
age -0.300 0.012
convergence code: 0
boundary (singular) fit: see ?isSingular
If you don’t want to go the mixed model route, the natural thing to do would be to conduct a regression for each individual group. Here is what that might look like:
library(purrr)
indPredictions <- performanceReview %>%
split(., f = .$idDepartment) %>%
map_df(~ data.frame(predicted = predict(lm(performanceScore ~ Manager + age,
data = .x)),
score = .x$performanceScore))
plot(indPredictions$score, indPredictions$predicted)
If we continue to look at our data (and with some knowledge about how NYC does health inspections), we will see that restaurants are rated yearly – let’s use this information in our model. We won’t worry about distance anymore, because now we have a few competing hypotheses. We could imagine two different ways that the works: one in which a restaurant’s score improves as observations increase (it takes some time for the owner to get his staff fully up to speed) or one in which the score decreases as the observations increase (the “shine has worn off the penny”).
Let’s do a bit of data processing first.
library(dplyr)
healthData <- fread(
"https://www.nd.edu/~sberry5/data/healthViolationsDistances.csv"
)
healthData <- healthData[, `:=`(BORO = as.factor(BORO),
cuisine = as.factor(`CUISINE DESCRIPTION`),
distanceCentered = dohDistanceMeter -
mean(dohDistanceMeter))]
healthData$nameLocation <- paste(healthData$DBA,
healthData$BUILDING,
sep = "_")
healthData$`GRADE DATE` <- lubridate::mdy(healthData$`GRADE DATE`)
healthData <- setkey(healthData, nameLocation)
healthData <- healthData[order(`GRADE DATE`), observation := 1:.N,
by = nameLocation]
timeReviewed <- healthData[,
.(n = .N),
by = nameLocation][n > 10]
reviewedRest <- healthData[which(healthData$nameLocation %in%
timeReviewed$nameLocation), ]
Now we can fit a mixed model, with each location getting its own intercept.
observationModInt <- lmer(SCORE ~ observation +
(1|nameLocation),
data = reviewedRest)
In this model, we have a fixed effect for observation and we are allowing each location to have it’s own random intercept. We have essentially created a model that will deal with the repeated measures for each of the locations.
plottingData <- head(reviewedRest[order(nameLocation, observation)], 350)
ggplot(plottingData, aes(observation, SCORE, group = nameLocation)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap( ~ nameLocation) +
theme_minimal()
summary(observationModInt)
Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 | nameLocation)
Data: reviewedRest
REML criterion at convergence: 345661
Scaled residuals:
Min 1Q Median 3Q Max
-5.0067 -0.5765 -0.1311 0.4606 6.6361
Random effects:
Groups Name Variance Std.Dev.
nameLocation (Intercept) 29.46 5.428
Residual 125.82 11.217
Number of obs: 44622, groups: nameLocation, 1750
Fixed effects:
Estimate Std. Error t value
(Intercept) 13.948018 0.159359 87.53
observation 0.461932 0.005331 86.65
Correlation of Fixed Effects:
(Intr)
observation -0.455
Our fixed effect here would indicate that we have a slight increase in scores as our observations increase, but we can also see that scores will bounce around about 5 points on average by location alone.
29.46 / (29.46 + 125.82)
[1] 0.1897218
We see that the location alone accounts for a substantial chunk of the variance in health scores.
MuMIn::r.squaredGLMM(observationModInt)
R2m R2c
[1,] 0.1618007 0.3208427
observationModSlope <- lmer(SCORE ~ observation +
(1 + observation|nameLocation),
data = reviewedRest)
summary(observationModSlope)
Linear mixed model fit by REML ['lmerMod']
Formula: SCORE ~ observation + (1 + observation | nameLocation)
Data: reviewedRest
REML criterion at convergence: 342130.5
Scaled residuals:
Min 1Q Median 3Q Max
-5.3899 -0.5413 -0.0949 0.4448 7.6038
Random effects:
Groups Name Variance Std.Dev. Corr
nameLocation (Intercept) 40.3156 6.3495
observation 0.2547 0.5046 -0.56
Residual 109.6190 10.4699
Number of obs: 44622, groups: nameLocation, 1750
Fixed effects:
Estimate Std. Error t value
(Intercept) 12.42495 0.18418 67.46
observation 0.63182 0.01486 42.51
Correlation of Fixed Effects:
(Intr)
observation -0.648
convergence code: 0
Model failed to converge with max|grad| = 0.00375371 (tol = 0.002, component 1)
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
The observation variance is telling us how much variability there is within the slope between locations
If we use the predicted values of our model, we can see what our results will look like over the observations:
reviewedRest$mixedPrediction <- predict(observationModSlope)
reviewedRest$lmPrediction <- predict(lm(SCORE ~ observation,
data = reviewedRest))
ggplot() +
geom_line(mapping = aes(observation, mixedPrediction, group = nameLocation),
data = reviewedRest, alpha = .25) +
geom_smooth(mapping = aes(x = observation, y = lmPrediction),
data = reviewedRest, method = "lm",
color = "red") +
theme_minimal()
Amazing! We can really see the varying intercepts for each restaurant and we can see how the slopes are completely different (some are similar, but we have a completely mixed bag of directions and magnitudes).
It looks like the standard regression line would do okay for many of our locations, but we can see that it would do poorly with many others. This should help to provide an illustration of just how flexible and powerful mixed model can be in your hands.
No matter what type of model you want to run, there is a mixed model extension for it. Any distribution can be used with the glmer function, nonlinear models can be fit with nlme from the nlme, and generalized additive models can be fit with gamm from mgcv.